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ABSTRACT 

A key challenge within the social network literature is the 
problem of network generation - that is, how can we create 
synthetic networks that match characteristics traditionally 
found in most real world networks? Important characteris- 
tics that are present in social networks include a power law 
degree distribution, small diameter and large amounts of 
clustering; however, most current network generators, such 
as the Chung Lu and Kronecker models, largely ignore the 
clustering present in a graph and choose to focus on preserv- 
ing other network statistics, such as the power law distribu- 
tion. Models such as the exponential random graph model 
have a transitivity parameter, but are computationally dif- 
ficult to learn, making scaling to large real world networks 
intractable. 

In this work, we propose an extension to the Chung Lu ran- 
dom graph model, the Transitive Chung Lu (TCL) model, 
which incorporates the notion of a random transitive edge. 
That is, with some probability it will choose to connect to 
a node exactly two hops away, having been introduced to 
a 'friend of a friend'. In all other cases it will follow the 
standard Chung Lu model, selecting a 'random surfer' from 
anywhere in the graph according to the given invariant dis- 
tribution. We prove TCL's expected degree distribution is 
equal to the degree distribution of the original graph, while 
being able to capture the clustering present in the network. 
The single parameter required by our model can be learned 
in seconds on graphs with millions of edges, while networks 
can be generated in time that is hnear in the number of 
edges. We demonstrate the performance TCL on four real- 
world social networks, including an email dataset with hun- 
dreds of thousands of nodes and millions of edges, showing 
TCL generates graphs that match the degree distribution, 
clustering coefficients and hop plots of the original networks. 

Categories and Subject Descriptors 

G.2.2 [Graph Theory]: Network problems; G.3 [Probability 
and Statistics]: Markov processes 



1. INTRODUCTION 

A challenging problem within the social network commu- 
nity is generating graphs which adhere to certain statistics. 
Due to the prevalence of 'small world' graphs such as Face- 
book and the Internet Jl6j , models which attempt to capture 
properties of small world graphs such as a power law de- 
gree distribution, small diameter and clustering greater than 
randomly present for the sparsity of the network have be- 
come a much-discussed topic in the field [t] p) [3l [Tol [2] p3] . 
The first random graph model, the Erdos-Renyi modelM , 
proposed random connections between nodes in the graph 
where each edge is sampled independently; however, this 
model has a Binomial degree distribution, not power law, 
and generally lacks clustering when generating sparse net- 
works. As a result, multiple attempts have been made to 
develop algorithms that generate graphs with small world 
network properties. 

Exponential Random Graph Models (ERGM) extend the 
Erdos-Renyi model to allow additional statistics of the graph 
as parameters [Ts]. The typical approach is to model the net- 
work under the assumption of Markov independence through- 
out the graph - edges are only dependent on other edges that 
share the same node . Using this, ERGMs define an expo- 
nential family of models using various Markov statistics of 
the graph, allowing for the incorporation of a transitivity 
parameter, then maximize the likelihood of the parameters 
given the graph. The algorithms for learning and generating 
ERGMs are resource intensive and intractable for applica- 
tion to networks of more than a few thousand nodes. 

As a result, newer efforts make scaleability an explicit goal 
when constructing models and algorithms. Notable exam- 
ples include the Chung-Lu Graph Model (CL) [s] and the 
Kronecker Product Graph Model (KPGM) [t]. CL is also 
an extension of the Erdos-Renyi model, but rather than cre- 
ating a summary statistic based on the degrees, it gener- 
ates a new graph such that the expected degree distribution 
matches the given distribution exactly. In contrast, KPGM 
learns a 2x2 matrix of parameters and lays down edges ac- 
cording to the Kronecker product of the matrix to itself log n 
times. For large graphs this algorithm can learn the param- 
eters defined by the 2x2 matrix in hours and can generate 
large graphs in minutes. 

With CL and KPGM we have scalable algorithms for learn- 
ing and generating graphs with hundreds of thousands of 
nodes and millions of edges. However, in order to achieve 



scalability, a power law degree distribution and small diame- 
ter, both models have made the decision to ignore clustering 
in their generated graphs. This is not an insignificant con- 
sequence, as a small world network is in part defined by the 
clustering of nodes [16] . While ERGM can potentially learn 
networks with clustering, the complexity of the model makes 
it a poor prospect when considering learning and generating 
graphs with massive size. 

In order to generate sparse networks which can accurately 
capture the degree distribution, small diameter and clus- 
tering, we propose to extend the CL algorithm in multiple 
ways. The first portion of this paper will show how the naive 
fast generation algorithm for the CL model is biased, and 
we develop a correction to this problem. Next, we introduce 
a generalization to the CL model known as the Transitive 
Chung Lu (TCL). To do this, we observe that the CL model 
is a 'random surfer' model, similar to the PageRank random 
walk algorithm^. However, CL always chooses the random 
surfer and has no affinity for nodes along transitive edges. 
In contrast, our TCL model will sometimes choose to fol- 
low these transitive edges and then close a triangle rather 
than selecting a random node according to the surfer. The 
probability of randomly surfing versus closing a triangle is a 
single parameter in our model which we can learn in seconds 
from an observed graph, compared to the hours required to 
learn KPGM. In short, the contributions in our work can be 
summarized as follows: 



• Introduction of a 'random triangle' parameter to the 
CL model 

• A correction to the 'edge collision' problem seen in 
naive fast CL model generation 

• Analysis showing TCL has an expected degree distri- 
bution equal to the original input network's degree dis- 
tribution 

• A learning algorithm for TCLs that runs in seconds for 
graphs with millions of edges 

• A generation algorithm for TCLs which runs on the 
same order as naive fast CL, and faster than KPGM 

• Empirical demonstrations that show the graphs gen- 
erated from TCL match the degree distribution, clus- 
tering coefficient and hop plots of the original graph 
better than fast CL or KPGM 



2. RELATED WORK 

Recently there has been a great deal of work focused on the 
development of generative models for small world and scale- 
free graphs (e.g., [5l[l6][2]|6|[T5|[7)|3]). As an example, 
the Chung Lu model is able to generate a network which 
has a provable expected degree distribution equal to the de- 
gree distribution of the original graph. The CL model, like 
many, attempts to define a process which matches a subset 
of features observed in a network. 

The importance of the clustering coefficient has been demon- 
strated by Watts and Strogatz jl^. In particular, they show 
that small world networks (including social networks) are 
characterized by a short path length and large clustering co- 
efficient. One recent algorithm (Seshadri et al [m]) matches 
these statistics by putting together nodes with similar de- 
grees and generating Erdos-Renyi graphs for each group. 
The groups are then tied together. However, this algorithm 
needs a parameter to be set manually to work. Existing 
models that can generate clustering in the network gener- 
ally do not have a training algorithm. 

One method that can model clustering and can learn the as- 
sociated parameter is the Exponential Random Graph Model 
(ERGM) [15]. ERGMs define a probability distribution over 
the set of possible graphs with a log-linear model that uses 
feature counts of local graph properties. However, these 
models are typically hard to train as each update of the 
Fisher scoring function takes O(n^). With real-world net- 
works numbering in the hundreds of thousands if not millions 
of nodes, this makes ERGMs impossible to fit. 

Another method is the Kronecker product graph model (KPGM), 
a scalable algorithm for learning models of large-scale net- 
works that empirically preserves a wide range of global prop- 
erties of interest, such as degree distributions, and path- 
length distributions [?]. Thanks to these characteristics, 
KPGM has been selected as a generation algorithm for the 
Graph 500 Supercomputer Benchmark 
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The KPGM starts with a initial square matrix Oi of size 
6x6, where each cell value is a probability. To generate 
a graph, the algorithm uses k Kronecker multiplications to 
grow until a determined size (obtaining Oi, with = N rows 
and columns). Each edge is then independently sampled 
using a Bernoulli distribution with parameter Ofc(i,j). A 
rough implementation of this algorithm has time 0{N^), but 
improved algorithms can generate a network in 0{M log A^), 
where M is the number of edges in the network [t] . Accord- 
ing to [7], the learning time is linear in the number of edges. 



In section [2] we discuss in more depth the ERGM, KPGM 
and CL models, while in section [3] we outline the basis for 
the CL model. Next, we show the fast method used for 
generating graphs in section [4] and our correction to it. In 
section [S] we introduce our modification to the CL model, 
proving the expected degree distribution and demonstrating 
how to learn the transitive probability, while in section [6] we 
analyze the runtimes of our fast CL correction and TCL. In 
section [7] we learn the parameter and generate graphs which 
closely match the original graphs. We end in section [S] with 
conclusions and future directions. 



3. CHUNG-LU MODEL AND INVARIANT 
MC DISTRIBUTION 

Define graph G — (V, E), where V is a set of A'^ vertices, or 
nodes, and E = V x V is a set of M edges or relationships 
between the vertices. Let A represent the adjacency matrix 
for G where: 



1 if _E contains the tuple (vi,Vj) 
otherwise 



Next, define the diagonal matrix D such that: 



Y,^ A^k if j = i 
otherwise 



(2) 



The diagonal of matrix D represents the degree of each node, 
where Da is the degree of node i. Finally, define the transi- 
tion probability matrix P: 



P 



(3) 



This transition probability matrix is the probability of ar- 
riving at any node during a random walk that is uniform 
over the edges. It is important to note that the rows of P 
are normalized: 



Algorithm 1 GL{t:,N,\E\ 
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= {} 
initialize{queue) 
for iterations do 

if queue is empty then 

Vj — pi_sample('K) 
else 

Vj — pop{queue) 
end if 

Vi = pi_sample{Tv) 
if eij ^ E'^^ then 



else 

push{queue, Vi) 
push{queue, Vj) 
end if 
end for 
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(4) 



3.1 Chung Lu Model 

The Chung Lu model assigns edges to the graph by indepen- 
dently laying edges for each possible edge with probability: 



2M 



It is assumed D^k < vM Vfc. The expected degree distri- 
bution for this graph is simply: 



Ea\D^^]^Y. 



A.A, 

2M 



= A. V ^ 
^ 2M 



3.2 Fast Chung Lu Model 

The invariant distribution of a graph is the distribution that 
when multiplied with the transition probability matrix re- 
turns itself: 



1 and M , and using offsets to index into the array. The 
next step is to draw another independent vertex Vj from the 
vector and place the edge between the two sampled nodes. 
In a special graph, the regular graph, we can show that the 
probability of an edge existing is exactly the same for the 
fast CL method and the slow. 



Proposition _Z. In a regular graph, the probability of an 
edge existing in the Fast Chung-Lu model is the same as the 
probability of an edge existing in the Slow Chung-Lu Model. 

Proof. Let Vi,Vj be two nodes in our network. Accord- 
ing to the Fast Chung Lu model we will select every node at 
random with replacement, meaning the number of times the 
node Vj will be selected as the first node is A , where D is 
the degree for every node in the network, and j is used for 
notation, to indicate the particular node. Since this graph 
is regular, Di — Dj ^i,j. The probability of an edge being 
placed from Vj to Vi is the sum: 



TT * P = TT 

A possible candidate for such a distribution is defined in 



terms of the degrees of the network, where 7r(j) — 



Dji . 



Da A' 



n_ 

2M 



D 



A. 
2M 



If we assume the matrix P is stationary (not changing as the 
random walker steps through the graph) and non-bipartite, 
the TT distribution is unique and tends to the stationary dis- 
tribution TT as the number of steps tends to infinity [s] . 



2M ^ 2M2M - Dk 

VkeV,kj^i 



Dk 



2M 2M ^ 2M-A 

_A_ A 2M-A 
2M 2M 2M - Dk 

2A 
2M 



+ 



(5) 



In [To|, the authors describe a fast edge-laying algorithm 
which runs in 0{M). The algorithm proceeds by creating a 
vector of size 0{M), then places the IDf of each node Vi in 
the vector Da times. It is not hard to see that since the sum 
of the degrees equals the number of edges in the graph, each 
node can place its ID exactly Du times without collision, 
and without leaving empty space in the vector. 

Next, a node ID Vi is drawn from the vector - this can be 
done in 0(1) by drawing a uiuform random variable between 



The sum continues to Dj. The probability of inserting on 
the d insertion is therefore: 



2M ^ 2M -D "' ^ 2M - (d- 1)D 2M - dD 

Dk2 



M 

2M ^ 2M - D ^ 2M -2D 



D,, 



Dki 



■E 



^'''i (6) 
2M - dD 



2M 




(a) Original (b) Adjusted (a) Epinions (b) Facebook 



Figure 1: Fast CL edge probability (y-axis) versus 
Slow CL edge probability (x-axis). In (a) we show 
the true Facebook 2012 Network and in (b) aug- 
mented to force high degree nodes to approach y/2M 
degree, and connect to each other. 



Figure 2: Comparison of Basic and Corrected CCDF 
on two datasets. The original method underesti- 
mates the degree of the high degree nodes. 

situations to arise. 



where Vki G V,ki ^ ki-i, . . . ,ki,i. Thus each time we place 
an edge from Dj, we place it with probability on node 
Vi. After d insertions, the probability of having an edge eji 
is then djM- draw M times for the first node, the 

expected number of draws on Dj is then ^, meaning the 
probability of connecting Vj to Vi is j^. If we include the 
opposite direction, we get which is the same as the 

probability in the slow method. □ 

Usually we do not have a regular graph, meaning the break- 
down between the degrees does not have the convenient can- 
cellation of sums like the regular graph. However, for sparse 
graphs we assume the proportion of degrees is close enough 
to one another such that the summations effectively cancel. 
The difference between the two probabilities is illustrated in 
Figure [l] To do this, we show the edge probabilities along 
the X-axis as predicted by the original CL method ( ^2m" ) 
versus a simulation of 10,000 networks for the fast edge prob- 
abilities. The y-axis indicates the proportion of generated 
networks which have the edge (we plot the top 10 degree 
nodes' edges). The dataset we use is a subset of the Pur- 
due University Facebook network, a snapshot of the class 
of 2012 with approximately 2000 nodes and 15,000 edges - 
using this smaller subset exaggerates the collisions and their 
effects on the edge probabilities. In panel (a), we show the 
probabilities for the original network, where the probabili- 
ties are small and unaffected by the fast model. 

To test the limits of the method, in panel (b) we take the 
high degree nodes from original network and expand them 
such that they have near \J2M edges elsewhere in the net- 
work. Additionally, these high degree nodes are connected to 
each other, meaning they approach the case where ^'2m" ^ 
1. Another 10000 networks are generated from the fast 
model to match this augmented network. We see that the 
randomly inserted edges still follow the predicted slow CL 
value, although the probabilities are slightly higher due to 
the increased degrees. It is only in the far extreme case 
where we connect V2M degree nodes to one another that 
we see a difference in the realized probability from the CL 
probability. These account for .05% of edges in the aug- 
mented network, which has been created specifically to test 
for problem cases. For social networks, it is unlikely for these 



4. CORRECTION TO FAST MODEL 

In order to actually generate this graph it is efficient to use 
rejection sampling. Namely, we draw two nodes from vr and 
attempt to place an edge between them. If an edge already 
exists, we reject the sample and draw again. In general, as 
we are using sparse graphs we will not have many collisions, 
and so few samples are rejected. 

One thing to notice that the algorithm assumes we can draw 
an edge only once (sampled without replacement), but nodes 
are drawn multiple times (sampled with replacement). How- 
ever, the samples are rejected according to whether or not an 
edge exists. As certain nodes have a higher degree, the prob- 
ability of collision is higher for them, meaning their edges 
are rejected more frequently than low degree nodes. Rejec- 
tion of those node samples means that the nodes have their 
degree under sampled. 

Proposition 2. When repeated samples of the same edge 
are dropped, the nodes of high degree have their degrees 
underestimated. 

Proof. Let Vi, Vj be two nodes in our network such that 
7r(i) > n{j), and let Vk be a node attempting to lay an edge 
with another node. Comparing the probability of collision 
on Vi,Vk vs. Vj,Vk gives us: 

P{eik) = Tv{i)Tv{k) > TT{j)n{k) 

P[coUision,k) = P{e,kf = (7r(fc)7r(i))2 > {■K{k)-n{j)f 

Since more edges are laid by high degree nodes than low, 
this implies it is more likely for high degree nodes such as 
Vi to experience collisions on edge insertions, biasing their 
expected degrees. □ 

One simple approach to correct this problem is to sample 
2M nodes independently from tt. The can then be paired 
together and the pairings checked for duplicates. Should 
any edges be laid more than once across a pair of nodes, the 
entire set of nodes is randomly permuted and rematched. 
This process continues until no duplicate pairings are found. 

The general idea behind this random permutation motivates 
our correction to the fast method. While the random per- 



mutation of all 2M nodes is somewhat extreme, a method 
which permutes only a few edges in the graph - the ones 
with collisions - is feasible. With this in mind, our solution 
to this problem is straightforward. Should we encounter a 
collision, we place both vertices in a waiting queue. Before 
continuing with regular insertions we will attempt to select 
neighbors for all nodes in the waiting queue. Should the new 
edge for a node in the queue also encounter a collision, the 
chosen neighbor is also placed in the queue, and so forth. 
This ensures that if a node is 'due' for a new edge but is 
prevented from receiving it due to a collision, the node is 
'slightly permuted' by exchanging places with a node sam- 
pled later. 

This shuffling ensures that Al edges actually be placed, which 
is needed by proposition [l] without affecting the degree dis- 
tribution as can happen by proposition 2. Furthermore, as 
we leave an edge if it ever occurs, the probability defined by 
proposition [T] is never lowered for the edges which have mul- 
tiple occurrences, only raised to ensure M edges are placed. 

Our correction to the fast CL model assumes independence 
between the edge placements and the current graph config- 
uration. This independence only truly holds when collisions 
are allowed (i.e. when generating a multigraph). In practice 
edge placements are not truly independent, as we disallow 
the placement of edges that already exist in the graph. The 
correction we have described removes the bias described in 
proposition [2] but is not guaranteed to generate graphs ex- 
actly according to the original tt distribution. The fast ver- 
sion of the graph generation algorithm must project from a 
space of multigraphs down into a space of simple graphs, and 
this projection is not necessarily uniform over the space of 
graphs. However, our empirical results show that on sparse 
graphs our correction removes the majority of the bias due 
to collisions and that the bias from the projection is negli- 
gible, meaning we can treat graphs form the corrected fast 
generation as being drawn from the original Chung-Lu graph 
distribution. While the slow Chung-Lu model is guaranteed 
to produce unbiased tt distributed graphs, the fast method 
produces graphs which are nearly indistinguishable from the 
slow method and runs an order of magnitude faster. 

In Figure [2] we can see the effect of the correction on two 
labeled datasets, Epinions and Facebook (described in sec- 
tion [7|. The green line corresponding to the simple inser- 
tion technique underestimates the degrees of the high degree 
nodes in both instances. The correction results in having a 
much closer match on the high degree nodes. By utilizing 
this correction, we can generate graphs whose degree distri- 
butions are unaffected by the possibility of collision and are 
able to generate graphs in 0{M). 

5. TRANSITIVE CHUNG-LU MODEL 

A large problem with the Chung-Lu model is the lack of 
transitivity captured by the model. As many social net- 
works (among others) are formed via friendships, drawing 
randomly from distribution of nodes across the network fails 
to capture this property. We propose the Transitive Chung 
Lu model described in algorithm [2j which has a probability 
of a 'random surfer' connecting two nodes across the net- 
work but has an additional probability of creating a new 
transitive edge across a pair of nodes connected by a 2 hop 



Algorithm 2 TCL{-k, p, N,\E\, iterations) 
^TCL ^ cL{tv,N,\E\) 
initialize{queue) 
for iterations do 

if queue is empty then 

Vj — pi_sample('K) 
else 

Vj — pop{queue) 
end if 

r = bernoulli_sample{p) 
if r = 1 then 

Vk = uniform_sample{Ej'^^) 
Vi = uniform-Sample{E^'-^^) 
else 

Vi = pi_sample{'rT) 
end if 

if eij ^ E'^^^ then 

j^TCL ^ j^TCL u e , , 

// remove oldest edge from E^^ '" 
j^TCL ^ j^TCL ^ rnin{time{E'^'''')) 
else 

push{queue, Vi) 
push{queue, Vj) 
end if 
end for 
return{E'^'^^) 
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path. The CL model is now a special case of TCL where 
p — and edge selection is always done through a random 
walk. In the TCL model, we include the transitive edges 
while maintaining the same expected invariant distribution 
as the CL model. Thus the TCL model is guaranteed to 
have an expected degree distribution equal to that of the 
original network. 

We begin by constructing a graph of M edges using the 
standard Chung-Lu model as described above. This gives 
us an initial edge set E which has the same expected degree 
distribution as the original data. We then initialize a queue 
which will be used to store nodes that have a higher prior- 
ity for receiving an edge. Next, we define an update step 
which replaces the oldest edge in the graph with a new one 
selected according to the TCL model and repeat this process 
for the specified number of iterations. If the priority queue 
is not empty, we will choose the next node in the queue to 
be Vj, the first endpoint of the edge; otherwise, on line 5 
we sample Vj using the tt distribution. With probability p 
we will add an edge between Vj and some node Vi through 
transitive closure by choosing an intermediate node Vk uni- 
formly from j's neighbors, then selecting Vi uniformly from 
k's neighbors. In contrast, with probability (1 — p) we use 
the 'random surfer' method by randomly choosing Vi from 
the graph according to the invariant distribution n. 

Either method, transitive or random surfer, returns an ad- 
ditional node to the method to use as the other endpoint. 
If the selected edge is not already part of the graph, we will 
add it and remove the oldest edge in the graph (continually 
removing the warmup CL edges). If the selected edge is al- 
ready present in the graph, we place the selected endpoint 
nodes into the priority queue (lines 21 and 22). We repeat 



this replacement operation many times to ensure that the 
original graph is mostly replaced and then return the set of 
edges as the new graph. In practice, we find that M replace- 
ments - enough to remove all edges generated originally by 
CL - is sufficient. 

In order to show that this update operation preserves the 
expected degree distribution, we prove the following: 



i. From any starting point Vj the probability of a two 
hop walk ending on node Vi is 7r(i) 



the graph: 



2. The probability that TCL selects an 
same as CL selecting Cij 



3. The change in the expected degree distribution after a 
TCL iteration is zero 



As the graph is initialized to a CL that has expected degree 
distribution equal to the original graph, and updates are per- 
formed that that preserve the expected degree distribution, 
the final graph will have the same expected degree distri- 
bution through induction. Our update step is a stochastic 
combination of two edge insertion operations: one that sam- 
ples an edge using the vr distribution as in the standard CL 
model, and one that samples an edge based on 2 hop paths. 
Naturally the CL insertion select edges based on the tt dis- 
tribution by definition. Now we will show that sampling 
an edge using the existing 2 hop paths also selects edges 
according to the tt distribution. 



Theorem 1 . Starting from any node Vj , if the edges in the 
graph are distributed according to 7r(fc)7r(i) and the walker 
traverses two hops by sampling uniformly over the edges 
of Vj and subsequently the selected neighbor Vk of Vj, the 
probability of ending this walk on node Vi is n{i). 



P(walkji) — 



J2kav P{pathjki) ■ P{walk 



jki) 



Hi'ev Efcev P{pathjki') ■ P{walkjki') 

^fceV 2M 2M £)Ci jyCL 
3 k 

DjjPkk DkkDj',, 1 

ev ^keV 2M 2M d'^^d'P^ 



j k 

X^i'ev '^kev DkDii -j^^ 



2M 



= 7r(i) 



So regardless of the starting node j, the probability of land- 
ing on i after traveling 2 hops uniformly over the edges is 
7r(i). □ 



Utilizing the above theorem, we next show the probability 
of an edge eij existing in the graph is 



Theorem 2. The Transitive Chung-Lu model selects edge 
eij for insertion with probability 



(9) 



Proof. The inductive step randomly selects a node Vj 
from the invariant distribution tt to be the first endpoint 
of a new edge. From Vj, we have two options to complete 
the edge: with probability p we use the transitive closure to 
walk 2 hops to find the other endpoint, and with probability 
1 — p we perform a random surf using tt. The invariant 
distribution for 7r^'^^(j) can then be written as: 



Proof. We can represent the probability of a particular 
path Vj Vk ^ Vi existing in the graph as 



7r^'^^(j) = ^ n{j) [p * P{walkj^) -|- (1 - p)'K{i 



P (path Jki) 



2M 2M 



In theorem [T] we showed that P{walkji) = 7r(i). Now the 
probability of selecting edge dj can be written as: 



The probability of following this path in a uniform random 
walk in the CL graph, when it exists, is: 



P{walkjkr) = CL nCL 



Df^Di 



(8) 



P(e,j) = TT{j)*TV 

= n{j) * (p * 7r(i) + (1 — p) * 7r(i)) 
= 7r(j) * Tv{i) 



□ 



To calculate the probability of a walk starting on node j and 
ending at node i, we have to normalize by the probability 
of walking a 2 hop path from node j to any other node i' in 



Therefore, the inductive step of TCL will place the endpoints 
of the new edge according to tt. 



Corollary 1. The expected degree distribution of the graph 
produced by TCL is the same as the degree distribution of 
the input graph. 

Proof. The inductive step of TCL places an edge with 
endpoints distributed according to vr, so the expected in- 
crease in the degree of any node Vi is However, the in- 
ductive step will also remove the oldest edge that was placed 
into the network. Since the oldest edge can only have been 
placed in the graph through a Chung-Lu process or a transi- 
tive closure, the expected decrease in the degree is also 7r(i), 
which means the expected change in the degree distribution 
is zero. Because the CL initialization step produces a graph 
with expected degree distribution equal to the input graph's 
distribution, and the TCL update step causes zero expected 
change in the degree distribution the output graph of the 
TCL algorithm has expected degree distribution equal to 
the input graph's distribution by induction. □ 



This means we are placing edges according to n{i)n{j), and 
doing AI insertions. This is the same model as shown for 
the fast CL method, which also inserts M edges according 
to 7r(i)7r(j), meaning that if the fast CL method follows slow 
CL, TCL does as well. In practice, TCL and CL capture the 
degree distribution well (section [7|. 

5.1 Fitting Transitive Chung Lu 

Now that we have introduced a p parameter which controls 
the proportion of transitive edges in the network we need 
a method for learning this parameter from the original net- 
work. For this, we need to estimate the probability p by 
which edge formation is done by triadic closure, and the 
probability 1 — p by which the random surfer forms edges. 
We can accomplish this estimation using an Expectation 
Maximization algorithm. First, let Zij £ Z he latent vari- 
ables on each dj G E with values Zij G {1,0}, where 1 
indicates the edge e^j was laid by a transitive closure and 
indicates the edge was laid by a random surfer. Although 
the Z values are unknown we can jointly estimate them with 
p using EM. 

We can now define the conditional probability of placing an 
edge eij from starting node Vj given the method Zij by which 
the edge was placed: 



P{e,j\z,j = l,vj,p) ^p 2^ — — 

P {eij\zij = 0,Uj,p*) =(1 - p*) ■ 7r(i) 
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Given the starting node Vj , the probability of the edge exist- 
ing between Vi and Vj, given that the edge was placed due 
to a triangle closure is p times the probability of walking 
from Vj to a mutual neighbor of i and j and then continuing 
the walk on to i, while 1 — p is the probability the edge was 
placed by a random surfer. We now show the EM algorithm. 

Expectation 

Note that the conditional probability of Zij , given the edge 
Cij and p, can be defined in terms of the probability of an 
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Figure 3: Convergences of the EM algorithm — both 
in terms of time and number of iterations. 10000 
samples per iteration. 



edge being selected by the triangle closure divided by the 
probability of the edge being laid by any method. Using 
Bayes' Rule, our conditional distribution on Z is simply: 



P [z^j = ljeij,Wj 



[7^(^)] 



And our expectation of Zij is 

E[z,,\p'] = P (z,, 



He. 



Maximization 

To maximize this expectation, we note that p is a Bernoulli 
variable representing P{zij = 1). We sample a set of edges S 
uniformly from the graph to use as evidence when updating 
p. The variables Zij are conditionally independent given the 
edges and nodes in the graph, meaning the MLE update to 
p is then calculating the expectation of Zij £ § and then 
normalizing over the number of edges in S: 



P 



The method we used to sample these edge subsets was to 
select them uniformly from the set of all edges. This can 
be done quickly using the node ID vector we constructed 
for sampling from the n distribution. As any node i ap- 
pears Di times in this vector, sampling a node from the 
vector and then uniformly sampling one of its edges gives us 
^ M" * S" ~ 17 probability of sampling any given edge. We 
gathered subsets of 10000 edges per iteration and our EM 
algorithm converges in just a few seconds, even on datasets 
with millions of edges. Figure |3] shows the convergence time 
on each of the datasets. 

6. TIME COMPLEXITY 

The methods presented for both generating a new network 
and for learning the parameter p can be done in an efficient 



First, we need to bound the expected number of attempts 
to insert an edge into the graph. Note that a node Vi with 



c edges has probability 7r(j) — of hitting its own edge 
on the draw from tt; by extension, the probability of hitting 
its own edges k times is 7r(i)*. This represents a geometric 
distribution which has the expected value of hits H on the 
edges of the nodes being: 

E [H\n{^)] = (1 - n{^)) ^ ^(i)^"^ = 

This shows the expected number of attempts to insert an 
edge is bounded by a constant. As a result, we can gener- 
ate the graph in 0(N + M), the same complexity as Chung 
Lu. The initial steps of initializing our vector of node ids 
and running the basic CL model takes 0{N + M). Next, we 
need to generate M insertions while gradually removing the 
current edges. This can be seen in lines 3-24 of Algorithm|2] 
In this loop, the longest operations are selecting randomly 
from neighbors or removing an edge. Both of these opera- 
tions cost is in terms of the maximum degree of the network, 
which we assumed bounded, meaning those operations can 
be done in 0(1) time. As a result, the total runtime of graph 
generation is 0{N + M). 

For the learning algorithm, assume we have / iterations 
which gather s samples. It is 0(1) to draw a node from the 
graph and 0(1) to choose a neighbor, meaning each iteration 
costs 0{s). Coupled with the cost of creating the initial tt 
sampling vector, the total runtime is then 0{N + M + 1 ■ s). 

7. EXPERIMENTS 

For our experiments, we compared three different graph 
generating models. The first is the fast Chung Lu (CL) 
generation algorithm with our correction for the degree dis- 
tribution. The second is Kronecker Product Graph Model 
(KPGM) implemented with code taken from the SNAP li- 
brarjj^calculated by the authors [t]. Lastly, we compared the 
Transitive Chung Lu (TCL) method presented in this paper 
using the EM technique to estimate the p parameter. All 
experiments were performed in Python on a Macbook Pro, 
aside from the KPGM parameters which were generated on 
a desktop computer using CH — All of these networks were 
made undirected by reflecting the edges in the network, ex- 
cept for the Facebook network which is already undirected. 

7.1 Datasets 

To empirically evaluate the models, we learned model pa- 
rameters from real-world graphs and then generated new 
graphs using those parameters. We then compared the net- 
work statistics of the generated graphs with those of the 
original networks. The four networks used are all large so- 
cial networks, and their node and edge counts can be found 
in Figure [4] a. 

The first dataset we analyze is the Epinions dataset 
This network represents the users of Epinions, a website 
which encourages users to indicate other users whose con- 
sumer product reviews they 'trust'. The reviews of all users 
on a product are then weighted to incorporate both the re- 
viewer ratings and the amount of trust received from other 

^SNAP: Stanford Netwo rk Analysis Proje ct. Available at 
http:/ /snap. stanford.edu/snap/index.htmI 
"SNAP is written in C-|-|- 



users. The edge set of this network represents nominations 
of trustworthy individuals between the users. 

Next, we study the collection of Facebook friendships from 
the Purdue University Facebook network. In this network, 
the users can add each other to their lists of friends and so 
the edge set represents a friendship network. This network 
hcis been collected over a series of snapshots for the past 
4 years; we use nodes and friendships aggregated across all 
snapshots. 

The GnutellaSO network is a different type than the other 
networks presented. Gnutella is a Peer2Peer network where 
users are attempting to find seeds for file sharing [12]. The 
user reaches out to its current peers, querying if they have a 
file. If not, the friend refers them to other users who might 
have a file, repeating this process until a seed user can be 
found. Because this network represents the structure of a 
file sharing program rather than true social interactions, it 
has significantly less clustering than the other networks. 

Lastly, we study a collection of emails gathered from the 
SMTP logs of Purdue University [I]. This dataset has an 
edge between users who sent e-mail to each other. The mail- 
ing network has a small set of nodes which sent out mail at 
a vastly greater rate than normal nodes; these nodes were 
most likely mailing lists or automatic mailing systems. In 
order to correct for these 'spammer' nodes, we remove nodes 
with a degree greater than 1, 000 as these nodes did not rep- 
resent participants in any kind of social interaction. The 
network has over two hundred thousand nodes, and nearly 
two million edges (Figure [4]a). 

7.2 Running Time 

In Figure |3] we can see the convergence of the EM algorithm 
when learning parameter p, both in terms of the number of 
iterations and in terms of the total clock runtime. Due to 
the independent sample sets used for each iteration of the 
algorithm, we can estimate whether the sample set in each 
iteration is sufficiently large. If the sample size is too small 
the algorithm will be susceptible to variance in the samples 
and will not converge. Using Figure [3]a we can see that 
after 5 iterations of 10,000 samples each our EM method 
has converged to a smooth line. 

In addition to the convergence in terms of iterations, in Fig- 
ure |3]b we plot the wall time against the current estimated 
p. The gap between and the start of the colored lines 
indicates the amount of overhead needed to generate our 
degree distribution statistic and tt sampling vector for the 
given graph (a step also needed by CL). The Purdue Email 
network has the longest learning time at 3 seconds. For the 
same Email network, learning the KPGM parameters took 
approximately 2 hours and 15 minutes, so our TCL model 
can learn parameters from a network significantly faster than 
the KPGM model. 

Next, the performance in terms of graph generation speed 
is tested, shown in Figure |4]c. The maximum time taken to 
generate a graph by CL is 61 seconds for the Purdue Email 
dataset, compared to 141 seconds to generate via TCL. Since 
TCL must initialize the graph using CL and then lay its own 
edges, it is logical that TCL requires at least twice as long as 
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Figure 4: Dataset sizes, along with learning times and running times for each algorithm 



CL. The runtimes indicate that the transitive closures cost 
httle more in terms of generation time compared to the CL 
edge insertions. KPGM took 285 seconds to generate the 
same network. The discrepancy between KPGM and TCL 
is the result of the theoretical bounds of each - KPGM takes 
0{M log N) while TCL takes 0(M). 

7.3 Graph Statistics 

So far, the CL model has shown superiority in terms of learn- 
ing and runtime to both TCL and KPGM, while TCL has 
distanced itself from KPGM in the same measures. How- 
ever, the ability to learn and generate large graphs quickly 
is only a portion of the task, as generating a network with 
little or no resemblance to the given network does not meet 
the primary goal of modeling the network. 

In order to test the ability of the models to generate networks 
with similar characteristics to the original 4 networks, we 
compare them on three well known graph statistics: the 
degree distribution, the clustering coefficient and the hop 
plot. 

Matching the degree distribution is the goal of both the CL 
and KPGM models, as well as the new TCL algorithm. In 
the left hand column of Figure [5], the degree distributions of 
the networks generated from each model for each real-world 
is shown, compared against the original real-world networks' 
degree distribution. The measure used along the y-axis is 
the complementary cumulative degree distribution (CCDF), 
while the x-axis plots the degree, meaning the y- value at 
a point indicates the percentage of nodes with greater de- 
gree. The 4 networks have degree distributions of varying 
styles - the 3 social networks (Epinions, Facebook, and Pur- 
dueEmail) have curved degree distributions, compared to 
GnutellaSO whose degree distribution is nearly straight, in- 
dicating an exponential cutoff. As theorized, both the CL 
and TCL have a degree distribution which closely matches 
their expected degree distribution, regardless of the distribu- 
tion shape. KPGM best matches the GnutellaSO network, 
sharing an exponential cutoff indicated by a straight line, 
but is still separated from the original network's distribu- 
tion. With the social networks KPGM has an alternating 
dip/flat line pattern which does not resemble the true degree 
distribution. In contrast, TCL matches the distributions of 
all 4 networks with the same accuracy as the CL method, 
showing the model continues to match the degree distribu- 
tion well even with the addition of transitive closures. 

The next statistic we examine is TCL's ability to model clus- 
tering, as neither CL nor KPGM attempt to replicate the 
clustering found in social networks. As with the degree, we 
plot the CCDF on the y-axis, but against the local clustering 



coefficient on the x-axis. The clustering coefficient is a mea- 
sure comparing the number of triangles in the network vs. 
the possible number of triangles in the network, and a higher 
value indicates more clustering Hlf. On the network with 
the largest amount of clustering, Epinions, TCL matches the 
distribution of clustering coefficients well with the TCL dis- 
tribution lying on top of the original distribution. The same 
follows for Facebook and PurducEmail, despite the large size 
of the latter. The GnutellaSO has a remarkably low amount 
of clustering - so low that it is plotted in log-log scale - 
yet TCL is able to follow the distribution as well. Further- 
more, the networks exhibit a range of p values, but the TCL 
EM estimation is able to accurately capture the clustering 
behavior of the original network. 

In contrast, CL and KPGM cannot model the clustering dis- 
tribution. For each network, both methods lack appreciable 
amounts of clustering in their generated graphs, even un- 
dercutting the GnutellaSO network which has far less clus- 
tering than the others. This shows a key weakness with 
both models, as clustering is an importation characteristic 
of small-world networks. 

The last measure examined is the Hop Plot, in the right 
column of Figure [5] The Hop Plot indicates how tightly 
connected the graph is; for each x-value, the y-value corre- 
sponds to the percentage of nodes that are reachable within 
that many hops. When generating the hop plots, we ex- 
cluded any nodes with infinite hop distance and discarded 
disconnected components and orphaned nodes. All of the 
models followed the hop plots well, with TCL producing 
hop plots very close to the standard CL. This indicates that 
the transitive closures of TCL did not impact the connec- 
tivity of the graph and the gains in terms of clustering can 
be obtained without altering the hop plot. 

8. CONCLUSIONS 

In this paper we demonstrated a correction to the Chung 
Lu fast estimation algorithm and introduced the Transitive 
Chung Lu model. Given a real-world network, the TCL 
model learns and generates a graph which accurately cap- 
tures the degree distribution, clustering coefficient distribu- 
tion and hop plot found in the training network. We proved 
the algorithm generates a network in 0{M), on the order 
of CL and faster than KPGM. The amount of clustering 
in the generated network is controlled by a single parame- 
ter, and we demonstrated how estimating the parameter is 
several orders of magnitude faster than estimating KPGM. 
The networks generated by our TCL algorithm exhibit char- 
acteristics of the original network, including degree distri- 
bution and clustering, unlike the graphs generated by CL 
and KPGM. Future directions for these results are numer- 
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Figure 5: Degree distribution, clustering and hop plots for the Epinion, Facbook, GnutellaSO and PurdueEmail 
datasets. 



ous, including analysis of networks over time and methods 
which explore extrapolating a larger graph from a given 
graph. Lastly, while our analysis has TCL generating net- 
works which match the degree distributions and clustering 
of a real-world network, usage of a transitivity parameter 
for clustering is still a heuristic approach. A more formal 
analysis of the clustering expected from such a model would 
be worth pursuing. 
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